{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "copyright"
      },
      "outputs": [],
      "source": [
        "# Copyright 2020 Google LLC\n",
        "#\n",
        "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "#     https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "title"
      },
      "source": [
        "# Vertex client library for Python: Custom training tabular regression model for batch prediction with explanation\n",
        "\n",
        "<table align=\"left\">\n",
        "  <td>\n",
        "    <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/master/notebooks/official/explainable_ai/gapic-custom_tabular_regression_batch_explain.ipynb\">\n",
        "      <img src=\"https://cloud.google.com/ml-engine/images/colab-logo-32px.png\" alt=\"Colab logo\"> Run in Colab\n",
        "    </a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a href=\"https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/master/notebooks/official/explainable_ai/gapic-custom_tabular_regression_batch_explain.ipynb\">\n",
        "      <img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\">\n",
        "      View on GitHub\n",
        "    </a>\n",
        "  </td>\n",
        "</table>\n",
        "<br/><br/><br/>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "overview:custom,xai"
      },
      "source": [
        "## Overview\n",
        "\n",
        "\n",
        "This tutorial demonstrates how to use the Vertex client library for Python to train and deploy a custom tabular regression model for batch prediction with explanation."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dataset:custom,boston,lrg"
      },
      "source": [
        "### Dataset\n",
        "\n",
        "The dataset used for this tutorial is the [Boston Housing Prices dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html). The version of the dataset you will use in this tutorial is built into TensorFlow. The trained model predicts the median price of a house in units of 1K USD."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "objective:custom,training,batch_prediction,xai"
      },
      "source": [
        "### Objective\n",
        "\n",
        "In this tutorial, you create a custom trained model, with a training pipeline, from a Python script in a Google prebuilt Docker container using the Vertex client library for Python, and then do a batch prediction with explanations on the uploaded model. Alternatively, you can create custom trained models using `gcloud` command-line tool or online using Cloud Console.\n",
        "\n",
        "The steps performed include:\n",
        "\n",
        "- Create a Vertex custom job for training a model.\n",
        "- Train the TensorFlow model.\n",
        "- Retrieve and load the model artifacts.\n",
        "- View the model evaluation.\n",
        "- Set explanation parameters.\n",
        "- Upload the model as a Vertex `Model` resource.\n",
        "- Make a batch prediction with explanations."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "costs"
      },
      "source": [
        "### Costs\n",
        "\n",
        "This tutorial uses billable components of Google Cloud (GCP):\n",
        "\n",
        "* Vertex AI\n",
        "* Cloud Storage\n",
        "\n",
        "Learn about [Vertex AI\n",
        "pricing](https://cloud.google.com/vertex-ai/pricing) and [Cloud Storage\n",
        "pricing](https://cloud.google.com/storage/pricing), and use the [Pricing\n",
        "Calculator](https://cloud.google.com/products/calculator/)\n",
        "to generate a cost estimate based on your projected usage."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "install_aip"
      },
      "source": [
        "## Installation\n",
        "\n",
        "Install the latest version of Vertex client library for Python."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "install_aip"
      },
      "outputs": [],
      "source": [
        "import os\n",
        "import sys\n",
        "\n",
        "# Google Cloud Notebook\n",
        "if os.path.exists(\"/opt/deeplearning/metadata/env_version\"):\n",
        "    USER_FLAG = \"--user\"\n",
        "else:\n",
        "    USER_FLAG = \"\"\n",
        "\n",
        "! pip install --upgrade google-cloud-aiplatform $USER_FLAG"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "install_storage"
      },
      "source": [
        "Install the latest GA version of *google-cloud-storage* library as well."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "install_storage"
      },
      "outputs": [],
      "source": [
        "! pip install -U google-cloud-storage $USER_FLAG"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "install_other_packages"
      },
      "source": [
        "### Install other packages\n",
        "\n",
        "Install other packages required for this tutorial."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "install_tabulate"
      },
      "outputs": [],
      "source": [
        "! pip install -U tabulate $USER_FLAG"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "restart"
      },
      "source": [
        "### Restart the kernel\n",
        "\n",
        "Once you've installed the Vertex client library for Python and *google-cloud-storage* library, you need to restart the notebook kernel so it can find the packages."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "restart"
      },
      "outputs": [],
      "source": [
        "if not os.getenv(\"IS_TESTING\"):\n",
        "    # Automatically restart kernel after installs\n",
        "    import IPython\n",
        "\n",
        "    app = IPython.Application.instance()\n",
        "    app.kernel.do_shutdown(True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "before_you_begin"
      },
      "source": [
        "## Before you begin\n",
        "\n",
        "### GPU runtime\n",
        "\n",
        "*Make sure you're running this notebook in a GPU runtime if you have that option. In Colab, select* **Runtime > Change Runtime Type > GPU**\n",
        "\n",
        "### Set up your Google Cloud project\n",
        "\n",
        "**The following steps are required, regardless of your notebook environment.**\n",
        "\n",
        "1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.\n",
        "\n",
        "2. [Make sure that billing is enabled for your project.](https://cloud.google.com/billing/docs/how-to/modify-project)\n",
        "\n",
        "3. [Enable the Vertex AI APIs and Compute Engine APIs.](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,compute_component)\n",
        "\n",
        "4. [The Cloud SDK](https://cloud.google.com/sdk) is already installed in Google Cloud Notebook.\n",
        "\n",
        "5. Enter your project ID in the cell below. Then run the  cell to make sure the\n",
        "Cloud SDK uses the right project for all the commands in this notebook.\n",
        "\n",
        "**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "set_project_id"
      },
      "outputs": [],
      "source": [
        "PROJECT_ID = \"[your-project-id]\"  # @param {type:\"string\"}"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "autoset_project_id"
      },
      "outputs": [],
      "source": [
        "if PROJECT_ID == \"\" or PROJECT_ID is None or PROJECT_ID == \"[your-project-id]\":\n",
        "    # Get your GCP project id from gcloud\n",
        "    shell_output = !gcloud config list --format 'value(core.project)' 2>/dev/null\n",
        "    PROJECT_ID = shell_output[0]\n",
        "    print(\"Project ID:\", PROJECT_ID)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "set_gcloud_project_id"
      },
      "outputs": [],
      "source": [
        "! gcloud config set project $PROJECT_ID"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "region"
      },
      "source": [
        "#### Region\n",
        "\n",
        "You can also change the `REGION` variable, which is used for operations\n",
        "throughout the rest of this notebook.  Below are regions supported for Vertex AI. We recommend that you choose the region closest to you.\n",
        "\n",
        "- Americas: `us-central1`\n",
        "- Europe: `europe-west4`\n",
        "- Asia Pacific: `asia-east1`\n",
        "\n",
        "You may not use a multi-regional bucket for training with Vertex AI. Not all regions provide support for all Vertex AI services.\n",
        "\n",
        "Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "region"
      },
      "outputs": [],
      "source": [
        "REGION = \"us-central1\"  # @param {type: \"string\"}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "timestamp"
      },
      "source": [
        "#### Timestamp\n",
        "\n",
        "If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a timestamp for each instance session, and append the timestamp onto the name of resources you create in this tutorial."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "timestamp"
      },
      "outputs": [],
      "source": [
        "from datetime import datetime\n",
        "\n",
        "TIMESTAMP = datetime.now().strftime(\"%Y%m%d%H%M%S\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "gcp_authenticate"
      },
      "source": [
        "### Authenticate your Google Cloud account\n",
        "\n",
        "**If you are using Google Cloud Notebook**, your environment is already authenticated. Skip this step.\n",
        "\n",
        "**If you are using Colab**, run the cell below and follow the instructions when prompted to authenticate your account via oAuth.\n",
        "\n",
        "**Otherwise**, follow these steps:\n",
        "\n",
        "In the Cloud Console, go to the [Create service account key](https://console.cloud.google.com/apis/credentials/serviceaccountkey) page.\n",
        "\n",
        "**Click Create service account**.\n",
        "\n",
        "In the **Service account name** field, enter a name, and click **Create**.\n",
        "\n",
        "In the **Grant this service account access to project** section, click the Role drop-down list. Type \"Vertex\" into the filter box, and select **Vertex AI Administrator**. Type \"Storage Object Admin\" into the filter box, and select **Storage Object Admin**.\n",
        "\n",
        "Click Create. A JSON file that contains your key downloads to your local environment.\n",
        "\n",
        "Enter the path to your service account key as the `GOOGLE_APPLICATION_CREDENTIALS` variable in the cell below and run the cell."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "gcp_authenticate"
      },
      "outputs": [],
      "source": [
        "# If you are running this notebook in Colab, run this cell and follow the\n",
        "# instructions to authenticate your Google Cloud account. This provides access to your\n",
        "# Cloud Storage bucket and lets you submit training jobs and prediction\n",
        "# requests.\n",
        "\n",
        "# If on Google Cloud Notebook, then don't execute this code\n",
        "if not os.path.exists(\"/opt/deeplearning/metadata/env_version\"):\n",
        "    if \"google.colab\" in sys.modules:\n",
        "        from google.colab import auth as google_auth\n",
        "\n",
        "        google_auth.authenticate_user()\n",
        "\n",
        "    # If you are running this notebook locally, replace the string below with the\n",
        "    # path to your service account key and run this cell to authenticate your Google Cloud\n",
        "    # account.\n",
        "    elif not os.getenv(\"IS_TESTING\"):\n",
        "        %env GOOGLE_APPLICATION_CREDENTIALS ''"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bucket:custom"
      },
      "source": [
        "### Create a Cloud Storage bucket\n",
        "\n",
        "**The following steps are required, regardless of your notebook environment.**\n",
        "\n",
        "When you submit a custom training job using the Vertex client library for Python, you upload a Python package\n",
        "containing your training code to a Cloud Storage bucket. Vertex runs\n",
        "the code from this package. In this tutorial, Vertex also saves the\n",
        "trained model that results from your job in the same bucket. You can then\n",
        "create an `Endpoint` resource based on this output in order to serve\n",
        "online predictions.\n",
        "\n",
        "Set the name of your Cloud Storage bucket below. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "bucket"
      },
      "outputs": [],
      "source": [
        "BUCKET_NAME = \"gs://[your-bucket-name]\"  # @param {type:\"string\"}"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "autoset_bucket"
      },
      "outputs": [],
      "source": [
        "if BUCKET_NAME == \"\" or BUCKET_NAME is None or BUCKET_NAME == \"gs://[your-bucket-name]\":\n",
        "    BUCKET_NAME = \"gs://\" + PROJECT_ID + \"aip-\" + TIMESTAMP"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "create_bucket"
      },
      "source": [
        "**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "create_bucket"
      },
      "outputs": [],
      "source": [
        "! gsutil mb -l $REGION $BUCKET_NAME"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "validate_bucket"
      },
      "source": [
        "Finally, validate access to your Cloud Storage bucket by examining its contents:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "validate_bucket"
      },
      "outputs": [],
      "source": [
        "! gsutil ls -al $BUCKET_NAME"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "setup_vars"
      },
      "source": [
        "### Set up variables\n",
        "\n",
        "Next, set up some variables used throughout the tutorial.\n",
        "### Import libraries and define constants"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "import_aip:v1beta1,protobuf"
      },
      "source": [
        "#### Import Vertex client library for Python\n",
        "\n",
        "Import the Vertex client library for Python into your Python environment."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "import_aip:v1beta1,protobuf"
      },
      "outputs": [],
      "source": [
        "import time\n",
        "\n",
        "import google.cloud.aiplatform_v1beta1 as aip\n",
        "from google.protobuf import json_format\n",
        "from google.protobuf.struct_pb2 import Value"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aip_constants"
      },
      "source": [
        "#### Vertex AI constants\n",
        "\n",
        "Setup up the following constants for Vertex AI:\n",
        "\n",
        "- `API_ENDPOINT`: The Vertex AI service endpoint for dataset, model, job, pipeline and endpoint services.\n",
        "- `PARENT`: The Vertex AI location root path for `Dataset`, `Model`, `Job`, `Pipeline` and `Endpoint` resources."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "aip_constants"
      },
      "outputs": [],
      "source": [
        "# API service endpoint\n",
        "API_ENDPOINT = \"{}-aiplatform.googleapis.com\".format(REGION)\n",
        "\n",
        "# Vertex AI location root path for your dataset, model and endpoint resources\n",
        "PARENT = \"projects/\" + PROJECT_ID + \"/locations/\" + REGION"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "accelerators:training,prediction,cpu"
      },
      "source": [
        "#### Set hardware accelerators\n",
        "\n",
        "You can set hardware accelerators for training and prediction.\n",
        "\n",
        "Set the variables `TRAIN_GPU/TRAIN_NGPU` and `DEPLOY_GPU/DEPLOY_NGPU` to use a container image supporting a GPU and the number of GPUs allocated to the virtual machine (VM) instance. For example, to use a GPU container image with 4 Nvidia Telsa K80 GPUs allocated to each VM, you would specify:\n",
        "\n",
        "    (aip.AcceleratorType.NVIDIA_TESLA_K80, 4)\n",
        "\n",
        "\n",
        "Otherwise specify `(None, None)` to use a container image to run on a CPU.\n",
        "\n",
        "Learn [which accelerators are available in your region](https://cloud.google.com/vertex-ai/docs/general/locations#accelerators).\n",
        "\n",
        "*Note*: TF releases before 2.3 for GPU support will fail to load the custom trained model in this tutorial. It is a known issue and fixed in TF 2.3 -- which is caused by static graph ops that are generated in the serving function. If you encounter this issue on your own custom trained models, use a container image for TF 2.3 with GPU support."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "accelerators:training,prediction,cpu"
      },
      "outputs": [],
      "source": [
        "if os.getenv(\"IS_TESTING_TRAIN_GPU\"):\n",
        "    TRAIN_GPU, TRAIN_NGPU = (\n",
        "        aip.AcceleratorType.NVIDIA_TESLA_K80,\n",
        "        int(os.getenv(\"IS_TESTING_TRAIN_GPU\")),\n",
        "    )\n",
        "else:\n",
        "    TRAIN_GPU, TRAIN_NGPU = (aip.AcceleratorType.NVIDIA_TESLA_K80, 1)\n",
        "\n",
        "if os.getenv(\"IS_TESTING_DEPLOY_GPU\"):\n",
        "    DEPLOY_GPU, DEPLOY_NGPU = (\n",
        "        aip.AcceleratorType.NVIDIA_TESLA_K80,\n",
        "        int(os.getenv(\"IS_TESTING_DEPLOY_GPU\")),\n",
        "    )\n",
        "else:\n",
        "    DEPLOY_GPU, DEPLOY_NGPU = (None, None)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "container:training,prediction"
      },
      "source": [
        "#### Set pre-built containers\n",
        "\n",
        "Set the pre-built Docker container image for training and prediction.\n",
        "\n",
        "\n",
        "For the latest list, see [Pre-built containers for training](https://cloud.google.com/ai-platform-unified/docs/training/pre-built-containers).\n",
        "\n",
        "\n",
        "For the latest list, see [Pre-built containers for prediction](https://cloud.google.com/ai-platform-unified/docs/predictions/pre-built-containers)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "container:training,prediction"
      },
      "outputs": [],
      "source": [
        "if os.getenv(\"IS_TESTING_TF\"):\n",
        "    TF = os.getenv(\"IS_TESTING_TF\")\n",
        "else:\n",
        "    TF = \"2-1\"\n",
        "\n",
        "if TF[0] == \"2\":\n",
        "    if TRAIN_GPU:\n",
        "        TRAIN_VERSION = \"tf-gpu.{}\".format(TF)\n",
        "    else:\n",
        "        TRAIN_VERSION = \"tf-cpu.{}\".format(TF)\n",
        "    if DEPLOY_GPU:\n",
        "        DEPLOY_VERSION = \"tf2-gpu.{}\".format(TF)\n",
        "    else:\n",
        "        DEPLOY_VERSION = \"tf2-cpu.{}\".format(TF)\n",
        "else:\n",
        "    if TRAIN_GPU:\n",
        "        TRAIN_VERSION = \"tf-gpu.{}\".format(TF)\n",
        "    else:\n",
        "        TRAIN_VERSION = \"tf-cpu.{}\".format(TF)\n",
        "    if DEPLOY_GPU:\n",
        "        DEPLOY_VERSION = \"tf-gpu.{}\".format(TF)\n",
        "    else:\n",
        "        DEPLOY_VERSION = \"tf-cpu.{}\".format(TF)\n",
        "\n",
        "TRAIN_IMAGE = \"gcr.io/cloud-aiplatform/training/{}:latest\".format(TRAIN_VERSION)\n",
        "DEPLOY_IMAGE = \"gcr.io/cloud-aiplatform/prediction/{}:latest\".format(DEPLOY_VERSION)\n",
        "\n",
        "print(\"Training:\", TRAIN_IMAGE, TRAIN_GPU, TRAIN_NGPU)\n",
        "print(\"Deployment:\", DEPLOY_IMAGE, DEPLOY_GPU, DEPLOY_NGPU)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "machine:training,prediction"
      },
      "source": [
        "#### Set machine type\n",
        "\n",
        "Next, set the machine type to use for training and prediction.\n",
        "\n",
        "- Set the variables `TRAIN_COMPUTE` and `DEPLOY_COMPUTE` to configure  the compute resources for the VMs you will use for for training and prediction.\n",
        " - `machine type`\n",
        "     - `n1-standard`: 3.75GB of memory per vCPU.\n",
        "     - `n1-highmem`: 6.5GB of memory per vCPU\n",
        "     - `n1-highcpu`: 0.9 GB of memory per vCPU\n",
        " - `vCPUs`: number of \\[2, 4, 8, 16, 32, 64, 96 \\]\n",
        "\n",
        "*Note: The following is not supported for training:*\n",
        "\n",
        " - `standard`: 2 vCPUs\n",
        " - `highcpu`: 2, 4 and 8 vCPUs\n",
        "\n",
        "*Note: You may also use n2 and e2 machine types for training and deployment, but they do not support GPUs*.\n",
        "\n",
        "Learn [which machine types are available for training](https://cloud.google.com/vertex-ai/docs/training/configure-compute) and [for prediction](https://cloud.google.com/vertex-ai/docs/predictions/configure-compute)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "machine:training,prediction"
      },
      "outputs": [],
      "source": [
        "if os.getenv(\"IS_TESTING_TRAIN_MACHINE\"):\n",
        "    MACHINE_TYPE = os.getenv(\"IS_TESTING_TRAIN_MACHINE\")\n",
        "else:\n",
        "    MACHINE_TYPE = \"n1-standard\"\n",
        "\n",
        "VCPU = \"4\"\n",
        "TRAIN_COMPUTE = MACHINE_TYPE + \"-\" + VCPU\n",
        "print(\"Train machine type\", TRAIN_COMPUTE)\n",
        "\n",
        "if os.getenv(\"IS_TESTING_DEPLOY_MACHINE\"):\n",
        "    MACHINE_TYPE = os.getenv(\"IS_TESTING_DEPLOY_MACHINE\")\n",
        "else:\n",
        "    MACHINE_TYPE = \"n1-standard\"\n",
        "\n",
        "VCPU = \"4\"\n",
        "DEPLOY_COMPUTE = MACHINE_TYPE + \"-\" + VCPU\n",
        "print(\"Deploy machine type\", DEPLOY_COMPUTE)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tutorial_start:custom"
      },
      "source": [
        "# Tutorial\n",
        "\n",
        "Now you are ready to start creating your own custom model and training for Boston Housing."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "clients:custom"
      },
      "source": [
        "## Set up clients\n",
        "\n",
        "The Vertex client library for Python works as a client/server model. On your side (the Python script) you will create a client that sends requests and receives responses from the Vertex AI server.\n",
        "\n",
        "You will use different clients in this tutorial for different steps in the workflow. So set them all up upfront.\n",
        "\n",
        "- Model Service for `Model` resources.\n",
        "- Endpoint Service for deployment.\n",
        "- Job Service for batch jobs and custom training.\n",
        "- Prediction Service for serving."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "clients:custom"
      },
      "outputs": [],
      "source": [
        "# client options same for all services\n",
        "client_options = {\"api_endpoint\": API_ENDPOINT}\n",
        "\n",
        "\n",
        "def create_job_client():\n",
        "    client = aip.JobServiceClient(client_options=client_options)\n",
        "    return client\n",
        "\n",
        "\n",
        "def create_model_client():\n",
        "    client = aip.ModelServiceClient(client_options=client_options)\n",
        "    return client\n",
        "\n",
        "\n",
        "def create_endpoint_client():\n",
        "    client = aip.EndpointServiceClient(client_options=client_options)\n",
        "    return client\n",
        "\n",
        "\n",
        "def create_prediction_client():\n",
        "    client = aip.PredictionServiceClient(client_options=client_options)\n",
        "    return client\n",
        "\n",
        "\n",
        "clients = {}\n",
        "clients[\"job\"] = create_job_client()\n",
        "clients[\"model\"] = create_model_client()\n",
        "clients[\"endpoint\"] = create_endpoint_client()\n",
        "clients[\"prediction\"] = create_prediction_client()\n",
        "\n",
        "for client in clients.items():\n",
        "    print(client)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "train_custom_model"
      },
      "source": [
        "## Train a model\n",
        "\n",
        "There are two ways you can train a custom model using a container image:\n",
        "\n",
        "- **Use a Google Cloud prebuilt container**. If you use a prebuilt container, you will additionally specify a Python package to install into the container image. This Python package contains your code for training a custom model.\n",
        "\n",
        "- **Use your own custom container image**. If you use your own container, the container needs to contain your code for training a custom model."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "train_custom_job_specification:prebuilt_container"
      },
      "source": [
        "## Prepare your custom job specification\n",
        "\n",
        "Now that your clients are ready, your first step is to create a Job Specification for your custom training job. The job specification will consist of the following:\n",
        "\n",
        "- `worker_pool_spec` : The specification of the type of machine(s) you will use for training and how many (single or distributed)\n",
        "- `python_package_spec` : The specification of the Python package to be installed with the pre-built container."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "train_custom_job_machine_specification"
      },
      "source": [
        "### Prepare your machine specification\n",
        "\n",
        "Now define the machine specification for your custom training job. This tells Vertex AI what type of machine instance to provision for the training.\n",
        "  - `machine_type`: The type of Google Cloud instance to provision -- e.g., n1-standard-8.\n",
        "  - `accelerator_type`: The type, if any, of hardware accelerator. In this tutorial if you previously set the variable `TRAIN_GPU != None`, you are using a GPU; otherwise you will use a CPU.\n",
        "  - `accelerator_count`: The number of accelerators."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "train_custom_job_machine_specification"
      },
      "outputs": [],
      "source": [
        "if TRAIN_GPU:\n",
        "    machine_spec = {\n",
        "        \"machine_type\": TRAIN_COMPUTE,\n",
        "        \"accelerator_type\": TRAIN_GPU,\n",
        "        \"accelerator_count\": TRAIN_NGPU,\n",
        "    }\n",
        "else:\n",
        "    machine_spec = {\"machine_type\": TRAIN_COMPUTE, \"accelerator_count\": 0}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "train_custom_job_disk_specification"
      },
      "source": [
        "### Prepare your disk specification\n",
        "\n",
        "(optional) Now define the disk specification for your custom training job. This tells Vertex AI what type and size of disk to provision in each machine instance for the training.\n",
        "\n",
        "  - `boot_disk_type`: Either SSD or Standard. SSD is faster, and Standard is less expensive. Defaults to SSD.\n",
        "  - `boot_disk_size_gb`: Size of disk in GB."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "train_custom_job_disk_specification"
      },
      "outputs": [],
      "source": [
        "DISK_TYPE = \"pd-ssd\"  # [ pd-ssd, pd-standard]\n",
        "DISK_SIZE = 200  # GB\n",
        "\n",
        "disk_spec = {\"boot_disk_type\": DISK_TYPE, \"boot_disk_size_gb\": DISK_SIZE}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "train_custom_job_worker_pool_specification:prebuilt_container"
      },
      "source": [
        "### Define the worker pool specification\n",
        "\n",
        "Next, you define the worker pool specification for your custom training job. The worker pool specification will consist of the following:\n",
        "\n",
        "- `replica_count`: The number of instances to provision of this machine type.\n",
        "- `machine_spec`: The hardware specification.\n",
        "- `disk_spec` : (optional) The disk storage specification.\n",
        "\n",
        "- `python_package`: The Python training package to install on the VM instance(s) and which Python module to invoke, along with command line arguments for the Python module.\n",
        "\n",
        "Let's dive deeper now into the python package specification:\n",
        "\n",
        "-`executor_image_spec`: This is the docker image which is configured for your custom training job.\n",
        "\n",
        "-`package_uris`: This is a list of the locations (URIs) of your python training packages to install on the provisioned instance. The locations need to be in a Cloud Storage bucket. These can be either individual python files or a zip (archive) of an entire package. In the later case, the job service will unzip (unarchive) the contents into the docker image.\n",
        "\n",
        "-`python_module`: The Python module (script) to invoke for running the custom training job. In this example, you will be invoking `trainer.task.py` -- note that it was not neccessary to append the `.py` suffix.\n",
        "\n",
        "-`args`: The command line arguments to pass to the corresponding Pythom module. In this example, you will be setting:\n",
        "  - `\"--model-dir=\" + MODEL_DIR` : The Cloud Storage location where to store the model artifacts. There are two ways to tell the training script where to save the model artifacts:\n",
        "      - direct: You pass the Cloud Storage location as a command line argument to your training script (set variable `DIRECT = True`), or\n",
        "      - indirect: The service passes the Cloud Storage location as the environment variable `AIP_MODEL_DIR` to your training script (set variable `DIRECT = False`). In this case, you tell the service the model artifact location in the job specification.\n",
        "  - `\"--epochs=\" + EPOCHS`: The number of epochs for training.\n",
        "  - `\"--steps=\" + STEPS`: The number of steps (batches) per epoch.\n",
        "  - `\"--distribute=\" + TRAIN_STRATEGY\"` : The training distribution strategy to use for single or distributed training.\n",
        "     - `\"single\"`: single device.\n",
        "     - `\"mirror\"`: all GPU devices on a single compute instance.\n",
        "     - `\"multi\"`: all GPU devices on all compute instances."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "train_custom_job_worker_pool_specification:prebuilt_container"
      },
      "outputs": [],
      "source": [
        "JOB_NAME = \"custom_job_\" + TIMESTAMP\n",
        "MODEL_DIR = \"{}/{}\".format(BUCKET_NAME, JOB_NAME)\n",
        "\n",
        "if not TRAIN_NGPU or TRAIN_NGPU < 2:\n",
        "    TRAIN_STRATEGY = \"single\"\n",
        "else:\n",
        "    TRAIN_STRATEGY = \"mirror\"\n",
        "\n",
        "EPOCHS = 20\n",
        "STEPS = 100\n",
        "\n",
        "DIRECT = True\n",
        "if DIRECT:\n",
        "    CMDARGS = [\n",
        "        \"--model-dir=\" + MODEL_DIR,\n",
        "        \"--epochs=\" + str(EPOCHS),\n",
        "        \"--steps=\" + str(STEPS),\n",
        "        \"--distribute=\" + TRAIN_STRATEGY,\n",
        "    ]\n",
        "else:\n",
        "    CMDARGS = [\n",
        "        \"--epochs=\" + str(EPOCHS),\n",
        "        \"--steps=\" + str(STEPS),\n",
        "        \"--distribute=\" + TRAIN_STRATEGY,\n",
        "    ]\n",
        "\n",
        "worker_pool_spec = [\n",
        "    {\n",
        "        \"replica_count\": 1,\n",
        "        \"machine_spec\": machine_spec,\n",
        "        \"disk_spec\": disk_spec,\n",
        "        \"python_package_spec\": {\n",
        "            \"executor_image_uri\": TRAIN_IMAGE,\n",
        "            \"package_uris\": [BUCKET_NAME + \"/trainer_boston.tar.gz\"],\n",
        "            \"python_module\": \"trainer.task\",\n",
        "            \"args\": CMDARGS,\n",
        "        },\n",
        "    }\n",
        "]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "assemble_custom_job_specification"
      },
      "source": [
        "### Assemble a job specification\n",
        "\n",
        "Now assemble the complete description for the custom job specification:\n",
        "\n",
        "- `display_name`: The human readable name you assign to this custom job.\n",
        "- `job_spec`: The specification for the custom job.\n",
        "    - `worker_pool_specs`: The specification for the machine VM instances.\n",
        "    - `base_output_directory`: This tells the service the Cloud Storage location where to save the model artifacts (when variable `DIRECT = False`). The service will then pass the location to the training script as the environment variable `AIP_MODEL_DIR`, and the path will be of the form:\n",
        "\n",
        "                <output_uri_prefix>/model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "assemble_custom_job_specification"
      },
      "outputs": [],
      "source": [
        "if DIRECT:\n",
        "    job_spec = {\"worker_pool_specs\": worker_pool_spec}\n",
        "else:\n",
        "    job_spec = {\n",
        "        \"worker_pool_specs\": worker_pool_spec,\n",
        "        \"base_output_directory\": {\"output_uri_prefix\": MODEL_DIR},\n",
        "    }\n",
        "\n",
        "custom_job = {\"display_name\": JOB_NAME, \"job_spec\": job_spec}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "examine_training_package"
      },
      "source": [
        "### Examine the training package\n",
        "\n",
        "#### Package layout\n",
        "\n",
        "Before you start the training, you will look at how a Python package is assembled for a custom training job. When unarchived, the package contains the following directory/file layout.\n",
        "\n",
        "- PKG-INFO\n",
        "- README.md\n",
        "- setup.cfg\n",
        "- setup.py\n",
        "- trainer\n",
        "  - \\_\\_init\\_\\_.py\n",
        "  - task.py\n",
        "\n",
        "The files `setup.cfg` and `setup.py` are the instructions for installing the package into the operating environment of the Docker image.\n",
        "\n",
        "The file `trainer/task.py` is the Python script for executing the custom training job. *Note*, when we referred to it in the worker pool specification, we replace the directory slash with a dot (`trainer.task`) and dropped the file suffix (`.py`).\n",
        "\n",
        "#### Package Assembly\n",
        "\n",
        "In the following cells, you will assemble the training package."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "examine_training_package"
      },
      "outputs": [],
      "source": [
        "# Make folder for Python training script\n",
        "! rm -rf custom\n",
        "! mkdir custom\n",
        "\n",
        "# Add package information\n",
        "! touch custom/README.md\n",
        "\n",
        "setup_cfg = \"[egg_info]\\n\\ntag_build =\\n\\ntag_date = 0\"\n",
        "! echo \"$setup_cfg\" > custom/setup.cfg\n",
        "\n",
        "setup_py = \"import setuptools\\n\\nsetuptools.setup(\\n\\n    install_requires=[\\n\\n        'tensorflow_datasets==1.3.0',\\n\\n    ],\\n\\n    packages=setuptools.find_packages())\"\n",
        "! echo \"$setup_py\" > custom/setup.py\n",
        "\n",
        "pkg_info = \"Metadata-Version: 1.0\\n\\nName: Boston Housing tabular regression\\n\\nVersion: 0.0.0\\n\\nSummary: Demostration training script\\n\\nHome-page: www.google.com\\n\\nAuthor: Google\\n\\nAuthor-email: aferlitsch@google.com\\n\\nLicense: Public\\n\\nDescription: Demo\\n\\nPlatform: Vertex\"\n",
        "! echo \"$pkg_info\" > custom/PKG-INFO\n",
        "\n",
        "# Make the training subfolder\n",
        "! mkdir custom/trainer\n",
        "! touch custom/trainer/__init__.py"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "taskpy_contents:boston"
      },
      "source": [
        "#### Task.py contents\n",
        "\n",
        "In the next cell, you write the contents of the training script task.py. I won't go into detail, it's just there for you to browse. In summary:\n",
        "\n",
        "- Get the directory where to save the model artifacts from the command line (`--model_dir`), and if not specified, then from the environment variable `AIP_MODEL_DIR`.\n",
        "- Loads Boston Housing dataset from TF.Keras builtin datasets\n",
        "- Builds a simple deep neural network model using TF.Keras model API.\n",
        "- Compiles the model (`compile()`).\n",
        "- Sets a training distribution strategy according to the argument `args.distribute`.\n",
        "- Trains the model (`fit()`) with epochs specified by `args.epochs`.\n",
        "- Saves the trained model (`save(args.model_dir)`) to the specified model directory.\n",
        "- Saves the maximum value for each feature `f.write(str(params))` to the specified parameters file."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "taskpy_contents:boston"
      },
      "outputs": [],
      "source": [
        "%%writefile custom/trainer/task.py\n",
        "# Single, Mirror and Multi-Machine Distributed Training for Boston Housing\n",
        "\n",
        "import tensorflow_datasets as tfds\n",
        "import tensorflow as tf\n",
        "from tensorflow.python.client import device_lib\n",
        "import numpy as np\n",
        "import argparse\n",
        "import os\n",
        "import sys\n",
        "tfds.disable_progress_bar()\n",
        "\n",
        "parser = argparse.ArgumentParser()\n",
        "parser.add_argument('--model-dir', dest='model_dir',\n",
        "                    default=os.getenv('AIP_MODEL_DIR'), type=str, help='Model dir.')\n",
        "parser.add_argument('--lr', dest='lr',\n",
        "                    default=0.001, type=float,\n",
        "                    help='Learning rate.')\n",
        "parser.add_argument('--epochs', dest='epochs',\n",
        "                    default=20, type=int,\n",
        "                    help='Number of epochs.')\n",
        "parser.add_argument('--steps', dest='steps',\n",
        "                    default=100, type=int,\n",
        "                    help='Number of steps per epoch.')\n",
        "parser.add_argument('--distribute', dest='distribute', type=str, default='single',\n",
        "                    help='distributed training strategy')\n",
        "parser.add_argument('--param-file', dest='param_file',\n",
        "                    default='/tmp/param.txt', type=str,\n",
        "                    help='Output file for parameters')\n",
        "args = parser.parse_args()\n",
        "\n",
        "print('Python Version = {}'.format(sys.version))\n",
        "print('TensorFlow Version = {}'.format(tf.__version__))\n",
        "print('TF_CONFIG = {}'.format(os.environ.get('TF_CONFIG', 'Not found')))\n",
        "\n",
        "# Single Machine, single compute device\n",
        "if args.distribute == 'single':\n",
        "    if tf.test.is_gpu_available():\n",
        "        strategy = tf.distribute.OneDeviceStrategy(device=\"/gpu:0\")\n",
        "    else:\n",
        "        strategy = tf.distribute.OneDeviceStrategy(device=\"/cpu:0\")\n",
        "# Single Machine, multiple compute device\n",
        "elif args.distribute == 'mirror':\n",
        "    strategy = tf.distribute.MirroredStrategy()\n",
        "# Multiple Machine, multiple compute device\n",
        "elif args.distribute == 'multi':\n",
        "    strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()\n",
        "\n",
        "# Multi-worker configuration\n",
        "print('num_replicas_in_sync = {}'.format(strategy.num_replicas_in_sync))\n",
        "\n",
        "\n",
        "def make_dataset():\n",
        "\n",
        "  # Scaling Boston Housing data features\n",
        "  def scale(feature):\n",
        "    max = np.max(feature)\n",
        "    feature = (feature / max).astype(np.float)\n",
        "    return feature, max\n",
        "\n",
        "  (x_train, y_train), (x_test, y_test) = tf.keras.datasets.boston_housing.load_data(\n",
        "    path=\"boston_housing.npz\", test_split=0.2, seed=113\n",
        "  )\n",
        "  params = []\n",
        "  for _ in range(13):\n",
        "    x_train[_], max = scale(x_train[_])\n",
        "    x_test[_], _ = scale(x_test[_])\n",
        "    params.append(max)\n",
        "\n",
        "  # store the normalization (max) value for each feature\n",
        "  with tf.io.gfile.GFile(args.param_file, 'w') as f:\n",
        "    f.write(str(params))\n",
        "  return (x_train, y_train), (x_test, y_test)\n",
        "\n",
        "\n",
        "# Build the Keras model\n",
        "def build_and_compile_dnn_model():\n",
        "  model = tf.keras.Sequential([\n",
        "      tf.keras.layers.Dense(128, activation='relu', input_shape=(13,)),\n",
        "      tf.keras.layers.Dense(128, activation='relu'),\n",
        "      tf.keras.layers.Dense(1, activation='linear')\n",
        "  ])\n",
        "  model.compile(\n",
        "      loss='mse',\n",
        "      optimizer=tf.keras.optimizers.RMSprop(learning_rate=args.lr))\n",
        "  return model\n",
        "\n",
        "NUM_WORKERS = strategy.num_replicas_in_sync\n",
        "# Here the batch size scales up by number of workers since\n",
        "# `tf.data.Dataset.batch` expects the global batch size.\n",
        "BATCH_SIZE = 16\n",
        "GLOBAL_BATCH_SIZE = BATCH_SIZE * NUM_WORKERS\n",
        "\n",
        "with strategy.scope():\n",
        "  # Creation of dataset, and model building/compiling need to be within\n",
        "  # `strategy.scope()`.\n",
        "  model = build_and_compile_dnn_model()\n",
        "\n",
        "# Train the model\n",
        "(x_train, y_train), (x_test, y_test) = make_dataset()\n",
        "model.fit(x_train, y_train, epochs=args.epochs, batch_size=GLOBAL_BATCH_SIZE)\n",
        "model.save(args.model_dir)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tarball_training_script"
      },
      "source": [
        "#### Store training script on your Cloud Storage bucket\n",
        "\n",
        "Next, you package the training folder into a compressed tar ball, and then store it in your Cloud Storage bucket."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "tarball_training_script"
      },
      "outputs": [],
      "source": [
        "! rm -f custom.tar custom.tar.gz\n",
        "! tar cvf custom.tar custom\n",
        "! gzip custom.tar\n",
        "! gsutil cp custom.tar.gz $BUCKET_NAME/trainer_boston.tar.gz"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "train_custom_job"
      },
      "source": [
        "### Train the model\n",
        "\n",
        "\n",
        "Now start the training of your custom training job on Vertex AI. Use this helper function `create_custom_job`, which takes the following parameter:\n",
        "\n",
        "-`custom_job`: The specification for the custom job.\n",
        "\n",
        "The helper function calls job client service's `create_custom_job` method, with the following parameters:\n",
        "\n",
        "-`parent`: The Vertex AI location path to `Dataset`, `Model` and `Endpoint` resources.\n",
        "-`custom_job`: The specification for the custom job.\n",
        "\n",
        "You will display a handful of the fields returned in `response` object, with the two that are of most interest are:\n",
        "\n",
        "-`response.name`: The Vertex AI fully qualified identifier assigned to this custom training job. You save this identifier for using in subsequent steps.\n",
        "\n",
        "-`response.state`: The current state of the custom training job."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "train_custom_job"
      },
      "outputs": [],
      "source": [
        "def create_custom_job(custom_job):\n",
        "    response = clients[\"job\"].create_custom_job(parent=PARENT, custom_job=custom_job)\n",
        "    print(\"name:\", response.name)\n",
        "    print(\"display_name:\", response.display_name)\n",
        "    print(\"state:\", response.state)\n",
        "    print(\"create_time:\", response.create_time)\n",
        "    print(\"update_time:\", response.update_time)\n",
        "    return response\n",
        "\n",
        "\n",
        "response = create_custom_job(custom_job)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "job_id:response"
      },
      "source": [
        "Now get the unique identifier for the custom job you created."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "job_id:response"
      },
      "outputs": [],
      "source": [
        "# The full unique ID for the custom job\n",
        "job_id = response.name\n",
        "# The short numeric ID for the custom job\n",
        "job_short_id = job_id.split(\"/\")[-1]\n",
        "\n",
        "print(job_id)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "get_custom_job"
      },
      "source": [
        "### Get information on a custom job\n",
        "\n",
        "Next, use this helper function `get_custom_job`, which takes the following parameter:\n",
        "\n",
        "- `name`: The Vertex AI fully qualified identifier for the custom job.\n",
        "\n",
        "The helper function calls the job client service's `get_custom_job` method, with the following parameter:\n",
        "\n",
        "- `name`: The Vertex AI fully qualified identifier for the custom job.\n",
        "\n",
        "If you recall, you got the Vertex AI fully qualified identifier for the custom job in the `response.name` field when you called the `create_custom_job` method, and saved the identifier in the variable `job_id`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "get_custom_job"
      },
      "outputs": [],
      "source": [
        "def get_custom_job(name, silent=False):\n",
        "    response = clients[\"job\"].get_custom_job(name=name)\n",
        "    if silent:\n",
        "        return response\n",
        "\n",
        "    print(\"name:\", response.name)\n",
        "    print(\"display_name:\", response.display_name)\n",
        "    print(\"state:\", response.state)\n",
        "    print(\"create_time:\", response.create_time)\n",
        "    print(\"update_time:\", response.update_time)\n",
        "    return response\n",
        "\n",
        "\n",
        "response = get_custom_job(job_id)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wait_training_complete:custom"
      },
      "source": [
        "# Deploy the model\n",
        "\n",
        "Training the above model may take upwards of 20 minutes time.\n",
        "\n",
        "Once your model is done training, you can calculate the actual time it took to train the model by subtracting `end_time` from `start_time`. For your model, we will need to know the location of the saved model, which the Python script saved in your local Cloud Storage bucket at `MODEL_DIR + '/saved_model.pb'`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "wait_training_complete:custom"
      },
      "outputs": [],
      "source": [
        "while True:\n",
        "    response = get_custom_job(job_id, True)\n",
        "    if response.state != aip.JobState.JOB_STATE_SUCCEEDED:\n",
        "        print(\"Training job has not completed:\", response.state)\n",
        "        model_path_to_deploy = None\n",
        "        if response.state == aip.JobState.JOB_STATE_FAILED:\n",
        "            break\n",
        "    else:\n",
        "        if not DIRECT:\n",
        "            MODEL_DIR = MODEL_DIR + \"/model\"\n",
        "        model_path_to_deploy = MODEL_DIR\n",
        "        print(\"Training Job Time:\", response.end_time - response.start_time)\n",
        "        print(\"Training Elapsed Time:\", response.update_time - response.create_time)\n",
        "        break\n",
        "    time.sleep(60)\n",
        "\n",
        "print(\"model_to_deploy:\", model_path_to_deploy)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "load_saved_model"
      },
      "source": [
        "## Load the saved model\n",
        "\n",
        "Your model is stored in a TensorFlow SavedModel format in a Cloud Storage bucket. Now load it from the Cloud Storage bucket, and then you can do some things, like evaluate the model, and do a prediction.\n",
        "\n",
        "To load, you use the TF.Keras `model.load_model()` method passing it the Cloud Storage path where the model is saved -- specified by `MODEL_DIR`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "load_saved_model"
      },
      "outputs": [],
      "source": [
        "import tensorflow as tf\n",
        "\n",
        "model = tf.keras.models.load_model(MODEL_DIR)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "evaluate_custom_model:tabular"
      },
      "source": [
        "## Evaluate the model\n",
        "\n",
        "Now let's find out how good the model is.\n",
        "\n",
        "### Load evaluation data\n",
        "\n",
        "You will load the Boston Housing test (holdout) data from `tf.keras.datasets`, using the method `load_data()`. This returns the dataset as a tuple of two elements. The first element is the training data and the second is the test data. Each element is also a tuple of two elements: the feature data, and the corresponding labels (median value of owner-occupied home).\n",
        "\n",
        "You don't need the training data, and hence why we loaded it as `(_, _)`.\n",
        "\n",
        "Before you can run the data through evaluation, you need to preprocess it:\n",
        "\n",
        "`x_test`:\n",
        "1. Normalize (rescale) the data in each column by dividing each value by the maximum value of that column. This replaces each single value with a 32-bit floating point number between 0 and 1."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "evaluate_custom_model:tabular,boston"
      },
      "outputs": [],
      "source": [
        "import numpy as np\n",
        "from tensorflow.keras.datasets import boston_housing\n",
        "\n",
        "(_, _), (x_test, y_test) = boston_housing.load_data(\n",
        "    path=\"boston_housing.npz\", test_split=0.2, seed=113\n",
        ")\n",
        "\n",
        "\n",
        "def scale(feature):\n",
        "    max = np.max(feature)\n",
        "    feature = (feature / max).astype(np.float32)\n",
        "    return feature\n",
        "\n",
        "\n",
        "# Let's save one data item that has not been scaled\n",
        "x_test_notscaled = x_test[0:1].copy()\n",
        "\n",
        "for _ in range(13):\n",
        "    x_test[_] = scale(x_test[_])\n",
        "x_test = x_test.astype(np.float32)\n",
        "\n",
        "print(x_test.shape, x_test.dtype, y_test.shape)\n",
        "print(\"scaled\", x_test[0])\n",
        "print(\"unscaled\", x_test_notscaled)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "perform_evaluation_custom"
      },
      "source": [
        "### Perform the model evaluation\n",
        "\n",
        "Now evaluate how well the model in the custom job did."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "perform_evaluation_custom"
      },
      "outputs": [],
      "source": [
        "model.evaluate(x_test, y_test)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "how_serving_function_works"
      },
      "source": [
        "## Upload the model for serving\n",
        "\n",
        "Next, you will upload your TF.Keras model from the custom job to Vertex `Model` service, which will create a Vertex `Model` resource for your custom model. During upload, you need to define a serving function to convert data to the format your model expects. If you send encoded data to Vertex AI, your serving function ensures that the data is decoded on the model server before it is passed as input to your model.\n",
        "\n",
        "### How does the serving function work\n",
        "\n",
        "When you send a request to an online prediction server, the request is received by a HTTP server. The HTTP server extracts the prediction request from the HTTP request content body. The extracted prediction request is forwarded to the serving function. For Google pre-built prediction containers, the request content is passed to the serving function as a `tf.string`.\n",
        "\n",
        "The serving function consists of two parts:\n",
        "\n",
        "- `preprocessing function`:\n",
        "  - Converts the input (`tf.string`) to the input shape and data type of the underlying model (dynamic graph).\n",
        "  - Performs the same preprocessing of the data that was done during training the underlying model, including normalizing and scalingh.\n",
        "- `post-processing function`:\n",
        "  - Converts the model output to format expected by the receiving application -- e.q., compresses the output.\n",
        "  - Packages the output for the the receiving application, including adding headings, and making JSON objects.\n",
        "\n",
        "Both the preprocessing and post-processing functions are converted to static graphs which are fused to the model. The output from the underlying model is passed to the post-processing function. The post-processing function passes the converted/packaged output back to the HTTP server. The HTTP server returns the output as the HTTP response content.\n",
        "\n",
        "One consideration you need to consider when building serving functions for TF.Keras models is that they run as static graphs. That means, you cannot use TF graph operations that require a dynamic graph. If you do, you will get an error during the compile of the serving function which will indicate that you are using an EagerTensor which is not supported."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "serving_function_signature:xai"
      },
      "source": [
        "## Get the serving function signature\n",
        "\n",
        "You can get the signatures of your model's input and output layers by reloading the model into memory, and querying it for the signatures corresponding to each layer.\n",
        "\n",
        "When making a prediction request, you need to route the request to the serving function instead of the model, so you need to know the input layer name of the serving function -- which you will use later when you make a prediction request.\n",
        "\n",
        "You also need to know the name of the serving function's input and output layer for constructing the explanation metadata -- which is discussed subsequently."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "serving_function_signature:xai"
      },
      "outputs": [],
      "source": [
        "loaded = tf.saved_model.load(model_path_to_deploy)\n",
        "\n",
        "serving_input = list(\n",
        "    loaded.signatures[\"serving_default\"].structured_input_signature[1].keys()\n",
        ")[0]\n",
        "print(\"Serving function input:\", serving_input)\n",
        "serving_output = list(loaded.signatures[\"serving_default\"].structured_outputs.keys())[0]\n",
        "print(\"Serving function output:\", serving_output)\n",
        "\n",
        "input_name = model.input.name\n",
        "print(\"Model input name:\", input_name)\n",
        "output_name = model.output.name\n",
        "print(\"Model output name:\", output_name)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "explanation_spec"
      },
      "source": [
        "### Explanation Specification\n",
        "\n",
        "To get explanations when doing a prediction, you must enable the explanation capability and set corresponding settings when you upload your custom model to a Vertex `Model` resource. These settings are referred to as the explanation metadata, which consists of:\n",
        "\n",
        "- `parameters`: This is the specification for the explainability algorithm to use for explanations on your model. You can choose between:\n",
        "  - Shapley - *Note*, not recommended for image data -- can be very long running\n",
        "  - XRAI\n",
        "  - Integrated Gradients\n",
        "- `metadata`: This is the specification for how the algoithm is applied on your custom model.\n",
        "\n",
        "#### Explanation Parameters\n",
        "\n",
        "Let's first dive deeper into the settings for the explainability algorithm.\n",
        "\n",
        "#### Shapley\n",
        "\n",
        "Assigns credit for the outcome to each feature, and considers different permutations of the features. This method provides a sampling approximation of exact Shapley values.\n",
        "\n",
        "Use Cases:\n",
        "  - Classification and regression on tabular data.\n",
        "\n",
        "Parameters:\n",
        "\n",
        "- `path_count`: This is the number of paths over the features that will be processed by the algorithm. An exact approximation of the Shapley values requires M! paths, where M is the number of features. For the CIFAR10 dataset, this would be 784 (28*28).\n",
        "\n",
        "For any non-trival number of features, this is too compute expensive. You can reduce the number of paths over the features to M * `path_count`.\n",
        "\n",
        "#### Integrated Gradients\n",
        "\n",
        "A gradients-based method to efficiently compute feature attributions with the same axiomatic properties as the Shapley value.\n",
        "\n",
        "Use Cases:\n",
        "  - Classification and regression on tabular data.\n",
        "  - Classification on image data.\n",
        "\n",
        "Parameters:\n",
        "\n",
        "- `step_count`: This is the number of steps to approximate the remaining sum. The more steps, the more accurate the integral approximation. The general rule of thumb is 50 steps, but as you increase so does the compute time.\n",
        "\n",
        "#### XRAI\n",
        "\n",
        "Based on the integrated gradients method, XRAI assesses overlapping regions of the image to create a saliency map, which highlights relevant regions of the image rather than pixels.\n",
        "\n",
        "Use Cases:\n",
        "\n",
        "  - Classification on image data.\n",
        "\n",
        "Parameters:\n",
        "\n",
        "- `step_count`: This is the number of steps to approximate the remaining sum. The more steps, the more accurate the integral approximation. The general rule of thumb is 50 steps, but as you increase so does the compute time.\n",
        "\n",
        "In the next code cell, set the variable `XAI` to which explainabilty algorithm you will use on your custom model."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "explanation_spec"
      },
      "outputs": [],
      "source": [
        "XAI = \"ig\"  # [ shapley, ig, xrai ]\n",
        "\n",
        "if XAI == \"shapley\":\n",
        "    PARAMETERS = {\"sampled_shapley_attribution\": {\"path_count\": 10}}\n",
        "elif XAI == \"ig\":\n",
        "    PARAMETERS = {\"integrated_gradients_attribution\": {\"step_count\": 50}}\n",
        "elif XAI == \"xrai\":\n",
        "    PARAMETERS = {\"xrai_attribution\": {\"step_count\": 50}}\n",
        "\n",
        "parameters = aip.ExplanationParameters(PARAMETERS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "explanation_metadata:tabular"
      },
      "source": [
        "#### Explanation Metadata\n",
        "\n",
        "Let's first dive deeper into the explanation metadata, which consists of:\n",
        "\n",
        "- `outputs`: A scalar value in the output to attribute -- what to explain. For example, in a probability output \\[0.1, 0.2, 0.7\\] for classification, one wants an explanation for 0.7. Consider the following formulae, where the output is `y` and that is what we want to explain.\n",
        "\n",
        "    y = f(x)\n",
        "\n",
        "Consider the following formulae, where the outputs are `y` and `z`. Since we can only do attribution for one scalar value, we have to pick whether we want to explain the output `y` or `z`. Assume in this example the model is object detection and y and z are the bounding box and the object classification. You would want to pick which of the two outputs to explain.\n",
        "\n",
        "    y, z = f(x)\n",
        "\n",
        "The dictionary format for `outputs` is:\n",
        "\n",
        "    { \"outputs\": { \"[your_display_name]\":\n",
        "                   \"output_tensor_name\": [layer]\n",
        "                 }\n",
        "    }\n",
        "\n",
        "<div style=\"margin-left: 25px;\">\n",
        "    <ul>\n",
        "       <li>[your_display_name]: A human readable name you assign to the output to explain. A common example is \"probability\".</li>\n",
        "       <li>\"output_tensor_name\": The key/value field to identify the output layer to explain. </li>\n",
        "       <li>[layer]: The output layer to explain. In a single task model, like a tabular regressor, it is the last (topmost) layer in the model</li>.\n",
        "    </ul>\n",
        "</div>\n",
        "\n",
        "\n",
        "- `inputs`: The features for attribution -- how they contributed to the output. Consider the following formulae, where `a` and `b` are the features. We have to pick which features to explain how the contributed. Assume that this model is deployed for A/B testing, where `a` are the data_items for the prediction and `b` identifies whether the model instance is A or B. You would want to pick `a` (or some subset of) for the features, and not `b` since it does not contribute to the prediction.\n",
        "\n",
        "    y = f(a,b)\n",
        "\n",
        "The minimum dictionary format for `inputs` is:\n",
        "\n",
        "    { \"inputs\": { \"[your_display_name]\":\n",
        "                  \"input_tensor_name\": [layer]\n",
        "                 }\n",
        "    }\n",
        "\n",
        "<div style=\"margin-left: 25px;\">\n",
        "    <ul>\n",
        "       <li>[your_display_name]: A human readable name you assign to the input to explain. A common example is \"features\".</li>\n",
        "       <li>\"input_tensor_name\": The key/value field to identify the input layer for the feature attribution. </li>\n",
        "       <li>[layer]: The input layer for feature attribution. In a single input tensor model, it is the first (bottom-most) layer in the model.</li>\n",
        "    </ul>\n",
        "</div>\n",
        "\n",
        "Since the inputs to the model are tabular, you can specify the following two additional fields as reporting/visualization aids:\n",
        "\n",
        "<div style=\"margin-left: 25px;\">\n",
        "    <ul>\n",
        "       <li>\"encoding\": \"BAG_OF_FEATURES\" : Indicates that the inputs are set of tabular features.</li>\n",
        "       <li>\"index_feature_mapping\": [ feature-names ] : A list of human readable names for each feature. For this example, we use the feature names specified in the dataset.</li>\n",
        "       <li>\"modality\": \"numeric\": Indicates the field values are numeric.</li>\n",
        "    </ul>\n",
        "</div>"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "explanation_metadata:tabular"
      },
      "outputs": [],
      "source": [
        "INPUT_METADATA = {\n",
        "    \"input_tensor_name\": serving_input,\n",
        "    \"encoding\": \"BAG_OF_FEATURES\",\n",
        "    \"modality\": \"numeric\",\n",
        "    \"index_feature_mapping\": [\n",
        "        \"crim\",\n",
        "        \"zn\",\n",
        "        \"indus\",\n",
        "        \"chas\",\n",
        "        \"nox\",\n",
        "        \"rm\",\n",
        "        \"age\",\n",
        "        \"dis\",\n",
        "        \"rad\",\n",
        "        \"tax\",\n",
        "        \"ptratio\",\n",
        "        \"b\",\n",
        "        \"lstat\",\n",
        "    ],\n",
        "}\n",
        "\n",
        "OUTPUT_METADATA = {\"output_tensor_name\": serving_output}\n",
        "\n",
        "input_metadata = aip.ExplanationMetadata.InputMetadata(INPUT_METADATA)\n",
        "output_metadata = aip.ExplanationMetadata.OutputMetadata(OUTPUT_METADATA)\n",
        "\n",
        "metadata = aip.ExplanationMetadata(\n",
        "    inputs={\"features\": input_metadata}, outputs={\"medv\": output_metadata}\n",
        ")\n",
        "\n",
        "explanation_spec = aip.ExplanationSpec(metadata=metadata, parameters=parameters)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "upload_the_model:explanation"
      },
      "source": [
        "### Upload the model\n",
        "\n",
        "Use this helper function `upload_model` to upload your model, stored in SavedModel format, up to the `Model` service, which will instantiate a Vertex `Model` resource instance for your model. Once you've done that, you can use the `Model` resource instance in the same way as any other Vertex `Model` resource instance, such as deploying to an `Endpoint` resource for serving predictions.\n",
        "\n",
        "The helper function takes the following parameters:\n",
        "\n",
        "- `display_name`: A human readable name for the `Endpoint` service.\n",
        "- `image_uri`: The container image for the model deployment.\n",
        "- `model_uri`: The Cloud Storage path to our SavedModel artificat. For this tutorial, this is the Cloud Storage location where the `trainer/task.py` saved the model artifacts, which we specified in the variable `MODEL_DIR`.\n",
        "\n",
        "The helper function calls the `Model` client service's method `upload_model`, which takes the following parameters:\n",
        "\n",
        "- `parent`: The Vertex AI location root path for `Dataset`, `Model` and `Endpoint` resources.\n",
        "- `model`: The specification for the Vertex `Model` resource instance.\n",
        "\n",
        "Let's now dive deeper into the Vertex model specification `model`. This is a dictionary object that consists of the following fields:\n",
        "\n",
        "- `display_name`: A human readable name for the `Model` resource.\n",
        "- `metadata_schema_uri`: Since your model was built without an Vertex `Dataset` resource, you will leave this blank (`''`).\n",
        "- `artifact_uri`: The Cloud Storage path where the model is stored in SavedModel format.\n",
        "- `container_spec`: This is the specification for the Docker container that will be installed on the `Endpoint` resource, from which the `Model` resource will serve predictions. Use the variable you set earlier `DEPLOY_GPU != None` to use a GPU; otherwise only a CPU is allocated.\n",
        "- `explanation_spec`: This is the specification for enabling explainability for your model.\n",
        "\n",
        "Uploading a model into a Vertex `Model` resource returns a long running operation, since it may take a few moments. You call response.result(), which is a synchronous call and will return when the Vertex Model resource is ready.\n",
        "\n",
        "The helper function returns the Vertex AI fully qualified identifier for the corresponding Vertex `Model` instance upload_model_response.model. You will save the identifier for subsequent steps in the variable model_to_deploy_id."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "upload_the_model:explanation"
      },
      "outputs": [],
      "source": [
        "IMAGE_URI = DEPLOY_IMAGE\n",
        "\n",
        "\n",
        "def upload_model(display_name, image_uri, model_uri):\n",
        "\n",
        "    model = aip.Model(\n",
        "        display_name=display_name,\n",
        "        artifact_uri=model_uri,\n",
        "        metadata_schema_uri=\"\",\n",
        "        explanation_spec=explanation_spec,\n",
        "        container_spec={\"image_uri\": image_uri},\n",
        "    )\n",
        "\n",
        "    response = clients[\"model\"].upload_model(parent=PARENT, model=model)\n",
        "    print(\"Long running operation:\", response.operation.name)\n",
        "    upload_model_response = response.result(timeout=180)\n",
        "    print(\"upload_model_response\")\n",
        "    print(\" model:\", upload_model_response.model)\n",
        "    return upload_model_response.model\n",
        "\n",
        "\n",
        "model_to_deploy_id = upload_model(\n",
        "    \"boston-\" + TIMESTAMP, IMAGE_URI, model_path_to_deploy\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "get_model"
      },
      "source": [
        "### Get `Model` resource information\n",
        "\n",
        "Now let's get the model information for just your model. Use this helper function `get_model`, with the following parameter:\n",
        "\n",
        "- `name`: The Vertex AI unique identifier for the `Model` resource.\n",
        "\n",
        "This helper function calls the Vertex AI `Model` client service's method `get_model`, with the following parameter:\n",
        "\n",
        "- `name`: The Vertex AI unique identifier for the `Model` resource."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "get_model"
      },
      "outputs": [],
      "source": [
        "def get_model(name):\n",
        "    response = clients[\"model\"].get_model(name=name)\n",
        "    print(response)\n",
        "\n",
        "\n",
        "get_model(model_to_deploy_id)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "deploy:batch_prediction"
      },
      "source": [
        "## Model deployment for batch prediction\n",
        "\n",
        "Now deploy the trained Vertex `Model` resource you created for batch prediction. This differs from deploying a `Model` resource for online prediction.\n",
        "\n",
        "For online prediction, you:\n",
        "\n",
        "1. Create an `Endpoint` resource for deploying the `Model` resource to.\n",
        "\n",
        "2. Deploy the `Model` resource to the `Endpoint` resource.\n",
        "\n",
        "3. Make online prediction requests to the `Endpoint` resource.\n",
        "\n",
        "For batch-prediction, you:\n",
        "\n",
        "1. Create a batch prediction job.\n",
        "\n",
        "2. The job service will provision resources for the batch prediction request.\n",
        "\n",
        "3. The results of the batch prediction request are returned to the caller.\n",
        "\n",
        "4. The job service will unprovision the resoures for the batch prediction request."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "make_prediction"
      },
      "source": [
        "## Send a batch prediction request\n",
        "\n",
        "Send a batch prediction to your deployed model."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "get_test_items:test,tabular"
      },
      "outputs": [],
      "source": [
        "test_item_1 = x_test[0]\n",
        "test_label_1 = y_test[0]\n",
        "test_item_2 = x_test[1]\n",
        "test_label_2 = y_test[1]\n",
        "print(test_item_1.shape)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "make_batch_file:custom,tabular"
      },
      "source": [
        "### Make the batch input file\n",
        "\n",
        "Now make a batch input file, which you will store in your local Cloud Storage bucket.  Each instance in the prediction request is a dictionary entry of the form:\n",
        "\n",
        "                        {serving_input: content}\n",
        "\n",
        "- `serving_input`: the name of the input layer of the underlying model.\n",
        "- `content`: The feature values of the test item as a list."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "make_batch_file:custom,tabular"
      },
      "outputs": [],
      "source": [
        "import json\n",
        "\n",
        "gcs_input_uri = BUCKET_NAME + \"/\" + \"test.jsonl\"\n",
        "with tf.io.gfile.GFile(gcs_input_uri, \"w\") as f:\n",
        "    data = {serving_input: test_item_1.tolist()}\n",
        "    f.write(json.dumps(data) + \"\\n\")\n",
        "    data = {serving_input: test_item_2.tolist()}\n",
        "    f.write(json.dumps(data) + \"\\n\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "instance_scaling"
      },
      "source": [
        "### Compute instance scaling\n",
        "\n",
        "You have several choices on scaling the compute instances for handling your batch prediction requests:\n",
        "\n",
        "- Single Instance: The batch prediction requests are processed on a single compute instance.\n",
        "  - Set the minimum (`MIN_NODES`) and maximum (`MAX_NODES`) number of compute instances to one.\n",
        "\n",
        "- Manual Scaling: The batch prediction requests are split across a fixed number of compute instances that you manually specified.\n",
        "  - Set the minimum (`MIN_NODES`) and maximum (`MAX_NODES`) number of compute instances to the same number of nodes. When a model is first deployed to the instance, the fixed number of compute instances are provisioned and batch prediction requests are evenly distributed across them.\n",
        "\n",
        "- Auto Scaling: The batch prediction requests are split across a scaleable number of compute instances.\n",
        "  - Set the minimum (`MIN_NODES`) number of compute instances to provision when a model is first deployed and to de-provision, and set the maximum (`MAX_NODES) number of compute instances to provision, depending on load conditions.\n",
        "\n",
        "The minimum number of compute instances corresponds to the field `min_replica_count` and the maximum number of compute instances corresponds to the field `max_replica_count`, in your subsequent deployment request."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "instance_scaling"
      },
      "outputs": [],
      "source": [
        "MIN_NODES = 1\n",
        "MAX_NODES = 1"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "make_batch_request:custom"
      },
      "source": [
        "### Make batch prediction request\n",
        "\n",
        "Now that your batch of two test items is ready, let's do the batch request. Use this helper function `create_batch_prediction_job`, with the following parameters:\n",
        "\n",
        "- `display_name`: The human readable name for the prediction job.\n",
        "- `model_name`: The Vertex AI fully qualified identifier for the `Model` resource.\n",
        "- `gcs_source_uri`: The Cloud Storage path to the input file -- which you created above.\n",
        "- `gcs_destination_output_uri_prefix`: The Cloud Storage path that the service will write the predictions to.\n",
        "- `parameters`: Additional filtering parameters for serving prediction results.\n",
        "\n",
        "The helper function calls the job client service's `create_batch_prediction_job` metho, with the following parameters:\n",
        "\n",
        "- `parent`: The Vertex AI location root path for Dataset, Model and Pipeline resources.\n",
        "- `batch_prediction_job`: The specification for the batch prediction job.\n",
        "\n",
        "Let's now dive into the specification for the `batch_prediction_job`:\n",
        "\n",
        "- `display_name`: The human readable name for the prediction batch job.\n",
        "- `model`: The Vertex AI fully qualified identifier for the `Model` resource.\n",
        "- `dedicated_resources`: The compute resources to provision for the batch prediction job.\n",
        "  - `machine_spec`: The compute instance to provision. Use the variable you set earlier `DEPLOY_GPU != None` to use a GPU; otherwise only a CPU is allocated.\n",
        "  - `starting_replica_count`: The number of compute instances to initially provision, which you set earlier as the variable `MIN_NODES`.\n",
        "  - `max_replica_count`: The maximum number of compute instances to scale to, which you set earlier as the variable `MAX_NODES`.\n",
        "- `model_parameters`: Additional filtering parameters for serving prediction results. No Additional parameters are supported for custom models.\n",
        "- `input_config`: The input source and format type for the instances to predict.\n",
        " - `instances_format`: The format of the batch prediction request file: `csv` or `jsonl`.\n",
        " - `gcs_source`: A list of one or more Cloud Storage paths to your batch prediction requests.\n",
        "- `output_config`: The output destination and format for the predictions.\n",
        " - `prediction_format`: The format of the batch prediction response file: `csv` or `jsonl`.\n",
        " - `gcs_destination`: The output destination for the predictions.\n",
        "\n",
        "This call is an asychronous operation. You will print from the response object a few select fields, including:\n",
        "\n",
        "- `name`: The Vertex AI fully qualified identifier assigned to the batch prediction job.\n",
        "- `display_name`: The human readable name for the prediction batch job.\n",
        "- `model`: The Vertex AI fully qualified identifier for the Model resource.\n",
        "- `generate_explanations`: Whether True/False explanations were provided with the predictions (explainability).\n",
        "- `state`: The state of the prediction job (pending, running, etc).\n",
        "\n",
        "Since this call will take a few moments to execute, you will likely get `JobState.JOB_STATE_PENDING` for `state`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "make_batch_request:custom"
      },
      "outputs": [],
      "source": [
        "BATCH_MODEL = \"boston_batch-\" + TIMESTAMP\n",
        "\n",
        "\n",
        "def create_batch_prediction_job(\n",
        "    display_name,\n",
        "    model_name,\n",
        "    gcs_source_uri,\n",
        "    gcs_destination_output_uri_prefix,\n",
        "    parameters=None,\n",
        "):\n",
        "\n",
        "    if DEPLOY_GPU:\n",
        "        machine_spec = {\n",
        "            \"machine_type\": DEPLOY_COMPUTE,\n",
        "            \"accelerator_type\": DEPLOY_GPU,\n",
        "            \"accelerator_count\": DEPLOY_NGPU,\n",
        "        }\n",
        "    else:\n",
        "        machine_spec = {\n",
        "            \"machine_type\": DEPLOY_COMPUTE,\n",
        "            \"accelerator_count\": 0,\n",
        "        }\n",
        "\n",
        "    batch_prediction_job = {\n",
        "        \"display_name\": display_name,\n",
        "        # Format: 'projects/{project}/locations/{location}/models/{model_id}'\n",
        "        \"model\": model_name,\n",
        "        \"model_parameters\": json_format.ParseDict(parameters, Value()),\n",
        "        \"input_config\": {\n",
        "            \"instances_format\": IN_FORMAT,\n",
        "            \"gcs_source\": {\"uris\": [gcs_source_uri]},\n",
        "        },\n",
        "        \"output_config\": {\n",
        "            \"predictions_format\": OUT_FORMAT,\n",
        "            \"gcs_destination\": {\"output_uri_prefix\": gcs_destination_output_uri_prefix},\n",
        "        },\n",
        "        \"dedicated_resources\": {\n",
        "            \"machine_spec\": machine_spec,\n",
        "            \"starting_replica_count\": MIN_NODES,\n",
        "            \"max_replica_count\": MAX_NODES,\n",
        "        },\n",
        "        \"generate_explanation\": True,\n",
        "    }\n",
        "    response = clients[\"job\"].create_batch_prediction_job(\n",
        "        parent=PARENT, batch_prediction_job=batch_prediction_job\n",
        "    )\n",
        "    print(\"response\")\n",
        "    print(\" name:\", response.name)\n",
        "    print(\" display_name:\", response.display_name)\n",
        "    print(\" model:\", response.model)\n",
        "    try:\n",
        "        print(\" generate_explanation:\", response.generate_explanation)\n",
        "    except:\n",
        "        pass\n",
        "    print(\" state:\", response.state)\n",
        "    print(\" create_time:\", response.create_time)\n",
        "    print(\" start_time:\", response.start_time)\n",
        "    print(\" end_time:\", response.end_time)\n",
        "    print(\" update_time:\", response.update_time)\n",
        "    print(\" labels:\", response.labels)\n",
        "    return response\n",
        "\n",
        "\n",
        "IN_FORMAT = \"jsonl\"\n",
        "OUT_FORMAT = \"jsonl\"\n",
        "\n",
        "response = create_batch_prediction_job(\n",
        "    BATCH_MODEL, model_to_deploy_id, gcs_input_uri, BUCKET_NAME\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "batch_job_id:response"
      },
      "source": [
        "Now get the unique identifier for the batch prediction job you created."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "batch_job_id:response"
      },
      "outputs": [],
      "source": [
        "# The full unique ID for the batch job\n",
        "batch_job_id = response.name\n",
        "# The short numeric ID for the batch job\n",
        "batch_job_short_id = batch_job_id.split(\"/\")[-1]\n",
        "\n",
        "print(batch_job_id)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "get_batch_prediction_job"
      },
      "source": [
        "### Get information on a batch prediction job\n",
        "\n",
        "Use this helper function `get_batch_prediction_job`, with the following paramter:\n",
        "\n",
        "- `job_name`: The Vertex AI fully qualified identifier for the batch prediction job.\n",
        "\n",
        "The helper function calls the job client service's `get_batch_prediction_job` method, with the following paramter:\n",
        "\n",
        "- `name`: The Vertex AI fully qualified identifier for the batch prediction job. In this tutorial, you will pass it the Vertex fully qualified identifier for your batch prediction job -- `batch_job_id`\n",
        "\n",
        "The helper function will return the Cloud Storage path to where the predictions are stored -- `gcs_destination`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "get_batch_prediction_job"
      },
      "outputs": [],
      "source": [
        "def get_batch_prediction_job(job_name, silent=False):\n",
        "    response = clients[\"job\"].get_batch_prediction_job(name=job_name)\n",
        "    if silent:\n",
        "        return response.output_config.gcs_destination.output_uri_prefix, response.state\n",
        "\n",
        "    print(\"response\")\n",
        "    print(\" name:\", response.name)\n",
        "    print(\" display_name:\", response.display_name)\n",
        "    print(\" model:\", response.model)\n",
        "    try:  # not all data types support explanations\n",
        "        print(\" generate_explanation:\", response.generate_explanation)\n",
        "    except:\n",
        "        pass\n",
        "    print(\" state:\", response.state)\n",
        "    print(\" error:\", response.error)\n",
        "    gcs_destination = response.output_config.gcs_destination\n",
        "    print(\" gcs_destination\")\n",
        "    print(\"  output_uri_prefix:\", gcs_destination.output_uri_prefix)\n",
        "    return gcs_destination.output_uri_prefix, response.state\n",
        "\n",
        "\n",
        "predictions, state = get_batch_prediction_job(batch_job_id)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "get_the_explanations:custom,tabular"
      },
      "source": [
        "### Get the predictions\n",
        "\n",
        "When the batch prediction is done processing, the job state will be `JOB_STATE_SUCCEEDED`.\n",
        "\n",
        "Finally you view the predictions stored at the Cloud Storage path you set as output. The predictions will be in a JSONL format, which you indicated at the time you made the batch prediction job. The predictions are in a subdirectory starting with the name `prediction`. Within that subdirectory there is a file named `prediction.results-xxxxx-of-xxxxx`.\n",
        "\n",
        "Now display (cat) the contents. You will see multiple JSON objects, one for each prediction.\n",
        "\n",
        "Finally you view the explanations stored at the Cloud Storage path you set as output. The explanations will be in a JSONL format, which you indicated at the time you made the batch explanation job. The explanations are in a subdirectory starting with the name `prediction`. Within that subdirectory there is a file named `explanations.results-xxxxx-of-xxxxx`.\n",
        "\n",
        "Let's display (cat) the contents. You will a row for each prediction -- in this case, there is just one row. The row contains:\n",
        "\n",
        "- `dense_input`: The input for the prediction.\n",
        "- `prediction`: The predicted value."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "get_the_explanations:custom,tabular"
      },
      "outputs": [],
      "source": [
        "def get_latest_predictions(gcs_out_dir):\n",
        "    \"\"\" Get the latest prediction subfolder using the timestamp in the subfolder name\"\"\"\n",
        "    folders = !gsutil ls $gcs_out_dir\n",
        "    latest = \"\"\n",
        "    for folder in folders:\n",
        "        subfolder = folder.split(\"/\")[-2]\n",
        "        if subfolder.startswith(\"prediction-\"):\n",
        "            if subfolder > latest:\n",
        "                latest = folder[:-1]\n",
        "    return latest\n",
        "\n",
        "\n",
        "while True:\n",
        "    predictions, state = get_batch_prediction_job(batch_job_id, True)\n",
        "    if state != aip.JobState.JOB_STATE_SUCCEEDED:\n",
        "        print(\"The job has not completed:\", state)\n",
        "        if state == aip.JobState.JOB_STATE_FAILED:\n",
        "            break\n",
        "    else:\n",
        "        folder = get_latest_predictions(predictions)\n",
        "        ! gsutil ls $folder/explanation.results*\n",
        "\n",
        "        print(\"Results:\")\n",
        "        ! gsutil cat $folder/explanation.results*\n",
        "\n",
        "        print(\"Errors:\")\n",
        "        ! gsutil cat $folder/prediction.errors*\n",
        "        break\n",
        "    time.sleep(60)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cleanup"
      },
      "source": [
        "# Cleaning up\n",
        "\n",
        "To clean up all Google Cloud resources used in this project, you can [delete the GCP\n",
        "project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.\n",
        "\n",
        "Otherwise, you can delete the individual resources you created in this tutorial:\n",
        "\n",
        "- Dataset\n",
        "- Pipeline\n",
        "- Model\n",
        "- Endpoint\n",
        "- Batch Job\n",
        "- Custom Job\n",
        "- Hyperparameter Tuning Job\n",
        "- Cloud Storage Bucket"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "cleanup"
      },
      "outputs": [],
      "source": [
        "delete_dataset = True\n",
        "delete_pipeline = True\n",
        "delete_model = True\n",
        "delete_endpoint = True\n",
        "delete_batchjob = True\n",
        "delete_customjob = True\n",
        "delete_hptjob = True\n",
        "delete_bucket = True\n",
        "\n",
        "# Delete the dataset using the Vertex fully qualified identifier for the dataset\n",
        "try:\n",
        "    if delete_dataset and \"dataset_id\" in globals():\n",
        "        clients[\"dataset\"].delete_dataset(name=dataset_id)\n",
        "except Exception as e:\n",
        "    print(e)\n",
        "\n",
        "# Delete the training pipeline using the Vertex fully qualified identifier for the pipeline\n",
        "try:\n",
        "    if delete_pipeline and \"pipeline_id\" in globals():\n",
        "        clients[\"pipeline\"].delete_training_pipeline(name=pipeline_id)\n",
        "except Exception as e:\n",
        "    print(e)\n",
        "\n",
        "# Delete the model using the Vertex fully qualified identifier for the model\n",
        "try:\n",
        "    if delete_model and \"model_to_deploy_id\" in globals():\n",
        "        clients[\"model\"].delete_model(name=model_to_deploy_id)\n",
        "except Exception as e:\n",
        "    print(e)\n",
        "\n",
        "# Delete the endpoint using the Vertex fully qualified identifier for the endpoint\n",
        "try:\n",
        "    if delete_endpoint and \"endpoint_id\" in globals():\n",
        "        clients[\"endpoint\"].delete_endpoint(name=endpoint_id)\n",
        "except Exception as e:\n",
        "    print(e)\n",
        "\n",
        "# Delete the batch job using the Vertex fully qualified identifier for the batch job\n",
        "try:\n",
        "    if delete_batchjob and \"batch_job_id\" in globals():\n",
        "        clients[\"job\"].delete_batch_prediction_job(name=batch_job_id)\n",
        "except Exception as e:\n",
        "    print(e)\n",
        "\n",
        "# Delete the custom job using the Vertex fully qualified identifier for the custom job\n",
        "try:\n",
        "    if delete_customjob and \"job_id\" in globals():\n",
        "        clients[\"job\"].delete_custom_job(name=job_id)\n",
        "except Exception as e:\n",
        "    print(e)\n",
        "\n",
        "# Delete the hyperparameter tuning job using the Vertex fully qualified identifier for the hyperparameter tuning job\n",
        "try:\n",
        "    if delete_hptjob and \"hpt_job_id\" in globals():\n",
        "        clients[\"job\"].delete_hyperparameter_tuning_job(name=hpt_job_id)\n",
        "except Exception as e:\n",
        "    print(e)\n",
        "\n",
        "if delete_bucket and \"BUCKET_NAME\" in globals():\n",
        "    ! gsutil rm -r $BUCKET_NAME"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "name": "gapic-custom_tabular_regression_batch_explain.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
